Abstract: Web robots are software programs which automatically traverse through hyperlink structure of Web to retrieve Web resources. Robots can be used for variety of tasks such as crawling and indexing information for search engines, offline browsing, shopping comparison and email collectors. Apart from that robots can also be used for some malicious purposes like sending spam mails, stealing business intelligence etc. It is necessary to detect robots due to privacy, security and performance of server related issues. Several well-known techniques to detect robots are : robots.txt check, known robot’s IP address, User agent mapping, keywords matching in User agent field, browsing speed, unassigned referrer etc. In this paper we have discussed as well as implemented various robot identification techniques on real server log data and compared their performance for a given dataset.

Keywords: Robot detection, Web server log, Web usage mining, Data extraction.